-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Feature] Support DiT Layerwise (Blockwise) CPU Offloading #858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
hsliuustc0106
merged 35 commits into
vllm-project:main
from
yuanheng-zhao:feat/layerwise-cpu-offload
Jan 30, 2026
Merged
Changes from all commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
fdbd01c
layerwise draft
yuanheng-zhao d79975c
draft
yuanheng-zhao 2df31a0
draft
yuanheng-zhao 62028f9
upd
yuanheng-zhao 5331795
apply aggregated flattened tensors
yuanheng-zhao b6036f3
fix offloader on wan2.2
yuanheng-zhao d42d51c
clean up
yuanheng-zhao 0a07f86
upd args in t2i, t2v offline examples
yuanheng-zhao cfe3699
apply cls attr to get blocks
yuanheng-zhao 3792ea3
upd
yuanheng-zhao b401bef
upd
yuanheng-zhao 8163f96
add serve args
yuanheng-zhao 8ab72f4
add doc
yuanheng-zhao 6e844a2
merge docs
yuanheng-zhao 75626f7
Add e2e tests
yuanheng-zhao 8b07c12
trivial upd
yuanheng-zhao 51d5810
trivial upd
yuanheng-zhao 82ee767
Merge branch 'main' into feat/layerwise-cpu-offload
hsliuustc0106 9f86fcc
Update vllm_omni/diffusion/offload.py
hsliuustc0106 308981d
upd refs
yuanheng-zhao 3c4bbc6
fix
yuanheng-zhao e7afde9
fix
yuanheng-zhao 2d8257e
fix config words
yuanheng-zhao ba6d840
upd arg name layerwise-num-gpu-layers
yuanheng-zhao 99b0d39
upd examples i2i, i2v
yuanheng-zhao 0faac2d
merge from main
yuanheng-zhao faa4e2d
upd e2e test
yuanheng-zhao 52a13df
merge from main
yuanheng-zhao b3ccb2a
fix wrong replacements
yuanheng-zhao 5b0bffe
revise e2e test
yuanheng-zhao c1297e7
upd e2e test
yuanheng-zhao 2025760
fix CI (use H100)
yuanheng-zhao 4e0e6ac
upd
yuanheng-zhao 5155c5e
make cpu offloading test use L4 rather than H100
yuanheng-zhao 976bd42
Merge branch 'main' into feat/layerwise-cpu-offload
yuanheng-zhao File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
hsliuustc0106 marked this conversation as resolved.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
110 changes: 110 additions & 0 deletions
110
tests/e2e/offline_inference/test_diffusion_layerwise_offload.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| import sys | ||
| from pathlib import Path | ||
|
|
||
| import pytest | ||
| import torch | ||
| from vllm.distributed.parallel_state import cleanup_dist_env_and_memory | ||
|
|
||
| from tests.utils import GPUMemoryMonitor | ||
| from vllm_omni.inputs.data import OmniDiffusionSamplingParams | ||
| from vllm_omni.platforms import current_omni_platform | ||
|
|
||
| # ruff: noqa: E402 | ||
| REPO_ROOT = Path(__file__).resolve().parents[2] | ||
| if str(REPO_ROOT) not in sys.path: | ||
| sys.path.insert(0, str(REPO_ROOT)) | ||
|
|
||
| from vllm_omni import Omni | ||
|
|
||
| # Models to test and expected saved memory in MB, correspondingly | ||
| MODELS_SAVED_MEMORY_MB = {"riverclouds/qwen_image_random": 4500} | ||
|
|
||
|
|
||
| def run_inference( | ||
| model_name: str, | ||
| layerwise_offload: bool = False, | ||
| num_gpu_layers: int = 1, | ||
| num_inference_steps: int = 3, | ||
| ) -> float: | ||
| # For now, only support on GPU, so apply torch.cuda operations here | ||
| # NPU / ROCm platforms are expected to be detected and skipped this test function | ||
| torch.cuda.empty_cache() | ||
| device_index = torch.cuda.current_device() | ||
| monitor = GPUMemoryMonitor(device_index=device_index, interval=0.02) | ||
| monitor.start() | ||
|
|
||
| m = Omni( | ||
| model=model_name, | ||
| enable_layerwise_offload=layerwise_offload, | ||
| layerwise_num_gpu_layers=num_gpu_layers, | ||
| boundary_ratio=0.875, | ||
| flow_shift=5.0, | ||
| ) | ||
|
|
||
| torch.cuda.reset_peak_memory_stats(device=device_index) | ||
|
|
||
| # Refer to tests/e2e/offline_inference/test_t2v_model.py | ||
| # Use minimal settings for testing | ||
| height = 480 | ||
| width = 640 | ||
| num_frames = 5 | ||
|
|
||
| m.generate( | ||
| "A cat sitting on a table", | ||
| OmniDiffusionSamplingParams( | ||
| height=height, | ||
| width=width, | ||
| generator=torch.Generator("cuda").manual_seed(42), | ||
| guidance_scale=1.0, | ||
| num_inference_steps=num_inference_steps, | ||
| num_frames=num_frames, | ||
| ), | ||
| ) | ||
|
hsliuustc0106 marked this conversation as resolved.
|
||
|
|
||
| peak = monitor.peak_used_mb | ||
| monitor.stop() | ||
|
|
||
| return peak | ||
|
|
||
|
|
||
| @pytest.mark.skipif(current_omni_platform.is_npu() or current_omni_platform.is_rocm(), reason="Hardware not supported") | ||
| @pytest.mark.parametrize("model_name", MODELS_SAVED_MEMORY_MB.keys()) | ||
| def test_layerwise_offload_diffusion_model(model_name: str): | ||
| """Test that layerwise offloading reduces GPU memory usage. | ||
|
|
||
| This test verifies that layerwise offloading significantly reduces peak | ||
| GPU memory usage compared to loading the entire model on GPU. The layerwise | ||
| offloader keeps only a single transformer block on GPU at a time, with | ||
| prefetching for compute-memory overlap. | ||
| """ | ||
| try: | ||
| # Run without layerwise offloading (baseline) | ||
| no_offload_peak_memory = run_inference(model_name, layerwise_offload=False) | ||
| cleanup_dist_env_and_memory() | ||
|
|
||
| # Run with layerwise offloading (1 layer on device) | ||
| layerwise_offload_peak_memory = run_inference(model_name, layerwise_offload=True, num_gpu_layers=1) | ||
| cleanup_dist_env_and_memory() | ||
|
|
||
| # Run with 2 layers on device | ||
| layerwise_offload_two_layers_peak = run_inference(model_name, layerwise_offload=True, num_gpu_layers=2) | ||
| except Exception: | ||
| pytest.fail("Inference failed") | ||
|
|
||
| print(f"Layerwise offload peak memory (1 GPU layer): {layerwise_offload_peak_memory} MB") | ||
| print(f"Layerwise offload peak memory (2 GPU layers): {layerwise_offload_two_layers_peak} MB") | ||
| print(f"No offload peak memory: {no_offload_peak_memory} MB") | ||
|
|
||
| # Verify that layerwise offloading significantly reduces memory usage | ||
| # Passes only if the actual savings exceeds the expected savings | ||
| assert layerwise_offload_peak_memory + MODELS_SAVED_MEMORY_MB[model_name] < no_offload_peak_memory, ( | ||
| f"Layerwise offload peak memory {layerwise_offload_peak_memory} MB " | ||
| f"should be significantly less than no offload peak memory {no_offload_peak_memory} MB" | ||
| ) | ||
|
|
||
| # Verify that 2 GPU layers uses more memory than 1 GPU layer | ||
| # But not excessively more (should be a reasonable increase) | ||
| assert layerwise_offload_peak_memory < layerwise_offload_two_layers_peak, ( | ||
| f"1 GPU layer peak {layerwise_offload_peak_memory} MB should be < " | ||
| f"2 GPU layers peak {layerwise_offload_two_layers_peak} MB" | ||
| ) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.