Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
fdbd01c
layerwise draft
yuanheng-zhao Jan 14, 2026
d79975c
draft
yuanheng-zhao Jan 16, 2026
2df31a0
draft
yuanheng-zhao Jan 19, 2026
62028f9
upd
yuanheng-zhao Jan 20, 2026
5331795
apply aggregated flattened tensors
yuanheng-zhao Jan 21, 2026
b6036f3
fix offloader on wan2.2
yuanheng-zhao Jan 26, 2026
d42d51c
clean up
yuanheng-zhao Jan 26, 2026
0a07f86
upd args in t2i, t2v offline examples
yuanheng-zhao Jan 26, 2026
cfe3699
apply cls attr to get blocks
yuanheng-zhao Jan 26, 2026
3792ea3
upd
yuanheng-zhao Jan 26, 2026
b401bef
upd
yuanheng-zhao Jan 27, 2026
8163f96
add serve args
yuanheng-zhao Jan 27, 2026
8ab72f4
add doc
yuanheng-zhao Jan 27, 2026
6e844a2
merge docs
yuanheng-zhao Jan 27, 2026
75626f7
Add e2e tests
yuanheng-zhao Jan 27, 2026
8b07c12
trivial upd
yuanheng-zhao Jan 27, 2026
51d5810
trivial upd
yuanheng-zhao Jan 27, 2026
82ee767
Merge branch 'main' into feat/layerwise-cpu-offload
hsliuustc0106 Jan 28, 2026
9f86fcc
Update vllm_omni/diffusion/offload.py
hsliuustc0106 Jan 28, 2026
308981d
upd refs
yuanheng-zhao Jan 28, 2026
3c4bbc6
fix
yuanheng-zhao Jan 28, 2026
e7afde9
fix
yuanheng-zhao Jan 28, 2026
2d8257e
fix config words
yuanheng-zhao Jan 28, 2026
ba6d840
upd arg name layerwise-num-gpu-layers
yuanheng-zhao Jan 29, 2026
99b0d39
upd examples i2i, i2v
yuanheng-zhao Jan 29, 2026
0faac2d
merge from main
yuanheng-zhao Jan 29, 2026
faa4e2d
upd e2e test
yuanheng-zhao Jan 29, 2026
52a13df
merge from main
yuanheng-zhao Jan 29, 2026
b3ccb2a
fix wrong replacements
yuanheng-zhao Jan 29, 2026
5b0bffe
revise e2e test
yuanheng-zhao Jan 29, 2026
c1297e7
upd e2e test
yuanheng-zhao Jan 29, 2026
2025760
fix CI (use H100)
yuanheng-zhao Jan 30, 2026
4e0e6ac
upd
yuanheng-zhao Jan 30, 2026
5155c5e
make cpu offloading test use L4 rather than H100
yuanheng-zhao Jan 30, 2026
976bd42
Merge branch 'main' into feat/layerwise-cpu-offload
yuanheng-zhao Jan 30, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ steps:
depends_on: image-build
commands:
- pytest -s -v tests/e2e/offline_inference/test_diffusion_cpu_offload.py
- pytest -s -v tests/e2e/offline_inference/test_diffusion_layerwise_offload.py
Comment thread
ZJY0516 marked this conversation as resolved.
agents:
queue: "gpu_1_queue" # g6.4xlarge instance on AWS, has 1 L4 GPU
plugins:
Expand Down
82 changes: 77 additions & 5 deletions docs/user_guide/diffusion/cpu_offload_diffusion.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,32 @@
# CPU Offloading for Diffusion Model

## Overview

vLLM-Omni provides two offloading strategies to reduce GPU memory usage for diffusion models, allowing you to run larger models on GPUs with limited VRAM:

1. **Model-level (Component) Offloading**: Swaps entire model components (DiT transformer, VAE, encoders) between GPU and CPU.
2. **Layerwise (Blockwise) Offloading**: Keeps only a single or a few transformer blocks on GPU at a time, with compute - memory copy overlap.

Both approaches use pinned memory for faster CPU-GPU transfers. For now, the two offloading strategies could not be used at the same time.


## Model-level CPU Offloading

### Implementation

CPU offload lets the diffusion worker move large model components between GPU and CPU memory on demand. It keeps the DiT transformer resident on GPU only while it is actively running, and swaps it out when encoders modules need the device. This reduces peak VRAM usage so bigger checkpoints run on smaller GPUs, or multiple requests can share the same GPU.

## Execution Model
**Execution Flow**:
1. Text encoders run on GPU while the DiT transformer is offloaded to CPU.
2. Before denoising, weights are prefetched back to GPU, honoring pinned-memory copies for speed.
3. After the diffusion step, the transformer returns to CPU and the process repeats as needed.

Transfers use pinned host buffers, and the worker coordinates swaps via mutex-style hooks so components never compete for memory.

## Configuration
### Configuration
You can enable CPU offload in two ways:

- **Python API**: set `enable_cpu_offload=True`.
1. **Python API**: set `enable_cpu_offload=True`.

```python
from vllm_omni import Omni
Expand All @@ -23,7 +36,66 @@ if __name__ == "__main__":
m = Omni(model="Qwen/Qwen-Image",enable_cpu_offload=True)
```

- **CLI**: pass `--enable-cpu-offload` to the diffusion service entrypoint.
2. **CLI**: pass `--enable-cpu-offload` to the diffusion service entrypoint.

## Known Limitations
### Limitations
- Cold start latency increases for over one minute for some models(e.g., Qwen-Image)


## Layerwise (Blockwise) Offloading

### Implementation
Layerwise offload operates at transformer block granularity, keeping a single transformer block, or a specified number of blocks, on GPU while others stay in CPU memory.

Unlike full model-wise CPU offload which swaps entire components like DiT and encoders, layerwise offloading applies a sliding window way of loading and offloading weights between gpu and cpu: while block `i` computes, block `i+1` get prefetched asynchronously via pinned memory. In this way, only partial blocks(s) reside on GPU at any moment during inference, so that greatly decrease the memory occupancy.

**Execution Flow**:

1. During model initialization, all components are loaded to CPU first. Then components other than DiT model(s) in the pipeline, such as VAE and encoders, are moved to GPU. The weights of target transformer blocks are collected as contiguous tensors per layer on CPU with pinned memory; and non-block modules (embeddings, norms, etc) in the DiT model are moved to and stay on GPU.
2. The first block(s) are transferred to GPU during initialization of `LayerwiseOffloader`, before the first denoising step of the very first request.
3. As each block executes, the next block prefetches on a separate CUDA stream for compute - memory copy overlap. After execution, the current block is immediately freed from GPU memory.
4. When the last block completes, the first block prefetches for the next denoising step.


Example of hook executions of a DiT model with n layers, by default keep a single layer on GPU:
| Layer (block) idx | forward pre-hook | forward | forward post-hook |
|-------------------|--------------------------------|------------------|---------------------------|
| layer-0 | prefetch layer 1 (copy stream) | compute layer 0 | free layer-0 gpu weights |
| layer-1 | prefetch layer 2 (copy stream) | compute layer 1 | free layer-1 gpu weights |
| layer-2 | prefetch layer 3 (copy stream) | compute layer 2 | free layer-2 gpu weights |
| ... | ... | ... | ... |
| layer-(n-1) | **prefetch layer 0 (copy stream)** | compute layer (n-1) | free layer (n-1) gpu weights |


### Configuration

1. **Python API**: set `enable_layerwise_offload=True` and optionally `layerwise_num_gpu_layers`.

```python
from vllm_omni import Omni

if __name__ == "__main__":
m = Omni(
model="Wan-AI/Wan2.2-T2V-A14B-Diffusers",
enable_layerwise_offload=True,
...
)
```

2. **CLI**: pass `--enable-layerwise-offload` and `--layerwise-num-gpu-layers` to the diffusion service entrypoint.

### Supported Models

| Architecture | Models | Example HF Models | DiT Model Cls | Blocks Attr Name |
|--------------|--------|-------------------|----------|----------|
| `QwenImagePipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image` | `QwenImageTransformer2DModel` | "transformer_blocks" |
| `Wan22Pipeline` | Wan2.2 | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | `WanTransformer3DModel` | "blocks" |

NOTE: Models must define `_layerwise_offload_blocks_attr` class attribute so that the layerwise offloader finds the target transformer blocks.

### Limitations
- Cold start latency increases because of
1) components are loaded to CPU first at the very first during initialization,
2) weight consolidation and pinning
- Performance depends on CPU <-> GPU interconnection (e.g., PCIe bandwidth).
- Support single GPU only for now
13 changes: 13 additions & 0 deletions examples/offline_inference/image_to_image/image_edit.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,17 @@ def parse_args() -> argparse.Namespace:
action="store_true",
help="Enable CPU offloading for diffusion models.",
)
parser.add_argument(
"--enable-layerwise-offload",
action="store_true",
help="Enable layerwise (blockwise) offloading on DiT modules.",
)
parser.add_argument(
"--layerwise-num-gpu-layers",
type=int,
default=1,
help="Number of ready layers (blocks) to keep on GPU during generation.",
)
return parser.parse_args()


Expand Down Expand Up @@ -350,6 +361,8 @@ def main():
# Initialize Omni with appropriate pipeline
omni = Omni(
model=args.model,
enable_layerwise_offload=args.enable_layerwise_offload,
layerwise_num_gpu_layers=args.layerwise_num_gpu_layers,
vae_use_slicing=args.vae_use_slicing,
vae_use_tiling=args.vae_use_tiling,
cache_backend=args.cache_backend,
Expand Down
13 changes: 13 additions & 0 deletions examples/offline_inference/image_to_video/image_to_video.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,17 @@ def parse_args() -> argparse.Namespace:
action="store_true",
help="Enable CPU offloading for diffusion models.",
)
parser.add_argument(
"--enable-layerwise-offload",
action="store_true",
help="Enable layerwise (blockwise) offloading on DiT modules.",
)
parser.add_argument(
"--layerwise-num-gpu-layers",
type=int,
default=1,
help="Number of ready layers (blocks) to keep on GPU during generation.",
)
return parser.parse_args()


Expand Down Expand Up @@ -112,6 +123,8 @@ def main():

omni = Omni(
model=args.model,
enable_layerwise_offload=args.enable_layerwise_offload,
layerwise_num_gpu_layers=args.layerwise_num_gpu_layers,
vae_use_slicing=args.vae_use_slicing,
vae_use_tiling=args.vae_use_tiling,
boundary_ratio=args.boundary_ratio,
Expand Down
13 changes: 13 additions & 0 deletions examples/offline_inference/text_to_image/text_to_image.py
Comment thread
hsliuustc0106 marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,17 @@ def parse_args() -> argparse.Namespace:
action="store_true",
help="Enable CPU offloading for diffusion models.",
)
parser.add_argument(
"--enable-layerwise-offload",
action="store_true",
help="Enable layerwise (blockwise) offloading on DiT modules.",
)
parser.add_argument(
"--layerwise-num-gpu-layers",
type=int,
default=1,
help="Number of ready layers (blocks) to keep on GPU during generation.",
)
parser.add_argument(
"--tensor_parallel_size",
type=int,
Expand Down Expand Up @@ -172,6 +183,8 @@ def main():

omni = Omni(
model=args.model,
enable_layerwise_offload=args.enable_layerwise_offload,
layerwise_num_gpu_layers=args.layerwise_num_gpu_layers,
vae_use_slicing=args.vae_use_slicing,
vae_use_tiling=args.vae_use_tiling,
cache_backend=args.cache_backend,
Expand Down
13 changes: 13 additions & 0 deletions examples/offline_inference/text_to_video/text_to_video.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,17 @@ def parse_args() -> argparse.Namespace:
action="store_true",
help="Enable CPU offloading for diffusion models.",
)
parser.add_argument(
"--enable-layerwise-offload",
action="store_true",
help="Enable layerwise (blockwise) offloading on DiT modules.",
)
parser.add_argument(
"--layerwise-num-gpu-layers",
type=int,
default=1,
help="Number of ready layers (blocks) to keep on GPU during generation.",
)
parser.add_argument(
"--ulysses_degree",
type=int,
Expand Down Expand Up @@ -128,6 +139,8 @@ def main():

omni = Omni(
model=args.model,
enable_layerwise_offload=args.enable_layerwise_offload,
layerwise_num_gpu_layers=args.layerwise_num_gpu_layers,
vae_use_slicing=args.vae_use_slicing,
vae_use_tiling=args.vae_use_tiling,
boundary_ratio=args.boundary_ratio,
Expand Down
110 changes: 110 additions & 0 deletions tests/e2e/offline_inference/test_diffusion_layerwise_offload.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
import sys
from pathlib import Path

import pytest
import torch
from vllm.distributed.parallel_state import cleanup_dist_env_and_memory

from tests.utils import GPUMemoryMonitor
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.platforms import current_omni_platform

# ruff: noqa: E402
REPO_ROOT = Path(__file__).resolve().parents[2]
if str(REPO_ROOT) not in sys.path:
sys.path.insert(0, str(REPO_ROOT))

from vllm_omni import Omni

# Models to test and expected saved memory in MB, correspondingly
MODELS_SAVED_MEMORY_MB = {"riverclouds/qwen_image_random": 4500}


def run_inference(
model_name: str,
layerwise_offload: bool = False,
num_gpu_layers: int = 1,
num_inference_steps: int = 3,
) -> float:
# For now, only support on GPU, so apply torch.cuda operations here
# NPU / ROCm platforms are expected to be detected and skipped this test function
torch.cuda.empty_cache()
device_index = torch.cuda.current_device()
monitor = GPUMemoryMonitor(device_index=device_index, interval=0.02)
monitor.start()

m = Omni(
model=model_name,
enable_layerwise_offload=layerwise_offload,
layerwise_num_gpu_layers=num_gpu_layers,
boundary_ratio=0.875,
flow_shift=5.0,
)

torch.cuda.reset_peak_memory_stats(device=device_index)

# Refer to tests/e2e/offline_inference/test_t2v_model.py
# Use minimal settings for testing
height = 480
width = 640
num_frames = 5

m.generate(
"A cat sitting on a table",
OmniDiffusionSamplingParams(
height=height,
width=width,
generator=torch.Generator("cuda").manual_seed(42),
guidance_scale=1.0,
num_inference_steps=num_inference_steps,
num_frames=num_frames,
),
)
Comment thread
hsliuustc0106 marked this conversation as resolved.

peak = monitor.peak_used_mb
monitor.stop()

return peak


@pytest.mark.skipif(current_omni_platform.is_npu() or current_omni_platform.is_rocm(), reason="Hardware not supported")
@pytest.mark.parametrize("model_name", MODELS_SAVED_MEMORY_MB.keys())
def test_layerwise_offload_diffusion_model(model_name: str):
"""Test that layerwise offloading reduces GPU memory usage.

This test verifies that layerwise offloading significantly reduces peak
GPU memory usage compared to loading the entire model on GPU. The layerwise
offloader keeps only a single transformer block on GPU at a time, with
prefetching for compute-memory overlap.
"""
try:
# Run without layerwise offloading (baseline)
no_offload_peak_memory = run_inference(model_name, layerwise_offload=False)
cleanup_dist_env_and_memory()

# Run with layerwise offloading (1 layer on device)
layerwise_offload_peak_memory = run_inference(model_name, layerwise_offload=True, num_gpu_layers=1)
cleanup_dist_env_and_memory()

# Run with 2 layers on device
layerwise_offload_two_layers_peak = run_inference(model_name, layerwise_offload=True, num_gpu_layers=2)
except Exception:
pytest.fail("Inference failed")

print(f"Layerwise offload peak memory (1 GPU layer): {layerwise_offload_peak_memory} MB")
print(f"Layerwise offload peak memory (2 GPU layers): {layerwise_offload_two_layers_peak} MB")
print(f"No offload peak memory: {no_offload_peak_memory} MB")

# Verify that layerwise offloading significantly reduces memory usage
# Passes only if the actual savings exceeds the expected savings
assert layerwise_offload_peak_memory + MODELS_SAVED_MEMORY_MB[model_name] < no_offload_peak_memory, (
f"Layerwise offload peak memory {layerwise_offload_peak_memory} MB "
f"should be significantly less than no offload peak memory {no_offload_peak_memory} MB"
)

# Verify that 2 GPU layers uses more memory than 1 GPU layer
# But not excessively more (should be a reasonable increase)
assert layerwise_offload_peak_memory < layerwise_offload_two_layers_peak, (
f"1 GPU layer peak {layerwise_offload_peak_memory} MB should be < "
f"2 GPU layers peak {layerwise_offload_two_layers_peak} MB"
)
6 changes: 6 additions & 0 deletions vllm_omni/diffusion/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,12 @@ class OmniDiffusionConfig:
# - Text encoders run on GPU while DiT is on CPU
# - DiT runs on GPU while encoders are on CPU
enable_cpu_offload: bool = False

# Layer-wise offloading (block-level offloading) parameters
enable_layerwise_offload: bool = False
# Number of transformer blocks ready for computation to keep on GPU
layerwise_num_gpu_layers: int = 1

use_fsdp_inference: bool = False
pin_cpu_memory: bool = True # Use pinned memory for faster transfers when offloading

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -758,6 +758,7 @@ class QwenImageTransformer2DModel(CachedTransformer):
# -- typically a transformer layer
# used for torch compile optimizations
_repeated_blocks = ["QwenImageTransformerBlock"]
_layerwise_offload_blocks_attr = "transformer_blocks"
packed_modules_mapping = {
"to_qkv": ["to_q", "to_k", "to_v"],
"add_kv_proj": ["add_q_proj", "add_k_proj", "add_v_proj"],
Expand Down
1 change: 1 addition & 0 deletions vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -531,6 +531,7 @@ class WanTransformer3DModel(nn.Module):
"""

_repeated_blocks = ["WanTransformerBlock"]
_layerwise_offload_blocks_attr = "blocks"
packed_modules_mapping = {
"to_qkv": ["to_q", "to_k", "to_v"],
}
Expand Down
Loading