vllm-project · hsliuustc0106 · Jan 30, 2026 · Jan 14, 2026 · Jan 16, 2026 · Jan 19, 2026
@@ -75,6 +75,7 @@ steps:
     depends_on: image-build
     commands:
       - pytest -s -v tests/e2e/offline_inference/test_diffusion_cpu_offload.py
+      - pytest -s -v tests/e2e/offline_inference/test_diffusion_layerwise_offload.py
     agents:
       queue: "gpu_1_queue" # g6.4xlarge instance on AWS, has 1 L4 GPU
     plugins:

@@ -1,19 +1,32 @@
 # CPU Offloading for Diffusion Model
 
 ## Overview
+
+vLLM-Omni provides two offloading strategies to reduce GPU memory usage for diffusion models, allowing you to run larger models on GPUs with limited VRAM:
+
+1. **Model-level (Component) Offloading**: Swaps entire model components (DiT transformer, VAE, encoders) between GPU and CPU.
+2. **Layerwise (Blockwise) Offloading**: Keeps only a single or a few transformer blocks on GPU at a time, with compute - memory copy overlap.
+
+Both approaches use pinned memory for faster CPU-GPU transfers. For now, the two offloading strategies could not be used at the same time.
+
+
+## Model-level CPU Offloading
+
+### Implementation
+
 CPU offload lets the diffusion worker move large model components between GPU and CPU memory on demand. It keeps the DiT transformer resident on GPU only while it is actively running, and swaps it out when encoders modules need the device. This reduces peak VRAM usage so bigger checkpoints run on smaller GPUs, or multiple requests can share the same GPU.
 
-## Execution Model
+**Execution Flow**:
 1. Text encoders run on GPU while the DiT transformer is offloaded to CPU.
 2. Before denoising, weights are prefetched back to GPU, honoring pinned-memory copies for speed.
 3. After the diffusion step, the transformer returns to CPU and the process repeats as needed.
 
 Transfers use pinned host buffers, and the worker coordinates swaps via mutex-style hooks so components never compete for memory.
 
-## Configuration
+### Configuration
 You can enable CPU offload in two ways:
 
-- **Python API**: set `enable_cpu_offload=True`.
+1. **Python API**: set `enable_cpu_offload=True`.
 
 ```python
 from vllm_omni import Omni
@@ -23,7 +36,66 @@ if __name__ == "__main__":
     m = Omni(model="Qwen/Qwen-Image",enable_cpu_offload=True)
 ```
 
-- **CLI**: pass `--enable-cpu-offload` to the diffusion service entrypoint.
+2. **CLI**: pass `--enable-cpu-offload` to the diffusion service entrypoint.
 
-## Known Limitations
+### Limitations
 - Cold start latency increases for over one minute for some models(e.g., Qwen-Image)
+
+
+## Layerwise (Blockwise) Offloading
+
+### Implementation
+Layerwise offload operates at transformer block granularity, keeping a single transformer block, or a specified number of blocks, on GPU while others stay in CPU memory.
+
+Unlike full model-wise CPU offload which swaps entire components like DiT and encoders, layerwise offloading applies a sliding window way of loading and offloading weights between gpu and cpu: while block `i` computes, block `i+1` get prefetched asynchronously via pinned memory. In this way, only partial blocks(s) reside on GPU at any moment during inference, so that greatly decrease the memory occupancy.
+
+**Execution Flow**:
+
+1. During model initialization, all components are loaded to CPU first. Then components other than DiT model(s) in the pipeline, such as VAE and encoders, are moved to GPU. The weights of target transformer blocks are collected as contiguous tensors per layer on CPU with pinned memory; and non-block modules (embeddings, norms, etc) in the DiT model are moved to and stay on GPU.
+2. The first block(s) are transferred to GPU during initialization of `LayerwiseOffloader`, before the first denoising step of the very first request.
+3. As each block executes, the next block prefetches on a separate CUDA stream for compute - memory copy overlap. After execution, the current block is immediately freed from GPU memory.
+4. When the last block completes, the first block prefetches for the next denoising step.
+
+
+Example of hook executions of a DiT model with n layers, by default keep a single layer on GPU:
+| Layer (block) idx | forward pre-hook               | forward          | forward post-hook         |
+|-------------------|--------------------------------|------------------|---------------------------|
+| layer-0           | prefetch layer 1 (copy stream) | compute layer 0  | free layer-0 gpu weights  |
+| layer-1           | prefetch layer 2 (copy stream) | compute layer 1  | free layer-1 gpu weights  |
+| layer-2           | prefetch layer 3 (copy stream) | compute layer 2  | free layer-2 gpu weights  |
+| ...               | ...                            | ...              | ...                       |
+| layer-(n-1)       | **prefetch layer 0 (copy stream)** | compute layer (n-1) | free layer (n-1) gpu weights  |
+
+
+### Configuration
+
+1. **Python API**: set `enable_layerwise_offload=True` and optionally `layerwise_num_gpu_layers`.
+
+```python
+from vllm_omni import Omni
+
+if __name__ == "__main__":
+    m = Omni(
+        model="Wan-AI/Wan2.2-T2V-A14B-Diffusers",
+        enable_layerwise_offload=True,
+        ...
+    )
+```
+
+2. **CLI**: pass `--enable-layerwise-offload` and `--layerwise-num-gpu-layers` to the diffusion service entrypoint.
+
+### Supported Models
+
+| Architecture | Models | Example HF Models | DiT Model Cls | Blocks Attr Name |
+|--------------|--------|-------------------|----------|----------|
+| `QwenImagePipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image` | `QwenImageTransformer2DModel` | "transformer_blocks" |
+| `Wan22Pipeline` | Wan2.2 | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | `WanTransformer3DModel` | "blocks" |
+
+NOTE: Models must define `_layerwise_offload_blocks_attr` class attribute so that the layerwise offloader finds the target transformer blocks.
+
+### Limitations
+- Cold start latency increases because of
+    1) components are loaded to CPU first at the very first during initialization,  
+    2) weight consolidation and pinning
+- Performance depends on CPU <-> GPU interconnection (e.g., PCIe bandwidth).
+- Support single GPU only for now
@@ -295,6 +295,17 @@ def parse_args() -> argparse.Namespace:
         action="store_true",
         help="Enable CPU offloading for diffusion models.",
     )
+    parser.add_argument(
+        "--enable-layerwise-offload",
+        action="store_true",
+        help="Enable layerwise (blockwise) offloading on DiT modules.",
+    )
+    parser.add_argument(
+        "--layerwise-num-gpu-layers",
+        type=int,
+        default=1,
+        help="Number of ready layers (blocks) to keep on GPU during generation.",
+    )
     return parser.parse_args()
 
 
@@ -350,6 +361,8 @@ def main():
     # Initialize Omni with appropriate pipeline
     omni = Omni(
         model=args.model,
+        enable_layerwise_offload=args.enable_layerwise_offload,
+        layerwise_num_gpu_layers=args.layerwise_num_gpu_layers,
         vae_use_slicing=args.vae_use_slicing,
         vae_use_tiling=args.vae_use_tiling,
         cache_backend=args.cache_backend,

@@ -74,6 +74,17 @@ def parse_args() -> argparse.Namespace:
         action="store_true",
         help="Enable CPU offloading for diffusion models.",
     )
+    parser.add_argument(
+        "--enable-layerwise-offload",
+        action="store_true",
+        help="Enable layerwise (blockwise) offloading on DiT modules.",
+    )
+    parser.add_argument(
+        "--layerwise-num-gpu-layers",
+        type=int,
+        default=1,
+        help="Number of ready layers (blocks) to keep on GPU during generation.",
+    )
     return parser.parse_args()
 
 
@@ -112,6 +123,8 @@ def main():
 
     omni = Omni(
         model=args.model,
+        enable_layerwise_offload=args.enable_layerwise_offload,
+        layerwise_num_gpu_layers=args.layerwise_num_gpu_layers,
         vae_use_slicing=args.vae_use_slicing,
         vae_use_tiling=args.vae_use_tiling,
         boundary_ratio=args.boundary_ratio,

@@ -107,6 +107,17 @@ def parse_args() -> argparse.Namespace:
         action="store_true",
         help="Enable CPU offloading for diffusion models.",
     )
+    parser.add_argument(
+        "--enable-layerwise-offload",
+        action="store_true",
+        help="Enable layerwise (blockwise) offloading on DiT modules.",
+    )
+    parser.add_argument(
+        "--layerwise-num-gpu-layers",
+        type=int,
+        default=1,
+        help="Number of ready layers (blocks) to keep on GPU during generation.",
+    )
     parser.add_argument(
         "--tensor_parallel_size",
         type=int,
@@ -172,6 +183,8 @@ def main():
 
     omni = Omni(
         model=args.model,
+        enable_layerwise_offload=args.enable_layerwise_offload,
+        layerwise_num_gpu_layers=args.layerwise_num_gpu_layers,
         vae_use_slicing=args.vae_use_slicing,
         vae_use_tiling=args.vae_use_tiling,
         cache_backend=args.cache_backend,

@@ -74,6 +74,17 @@ def parse_args() -> argparse.Namespace:
         action="store_true",
         help="Enable CPU offloading for diffusion models.",
     )
+    parser.add_argument(
+        "--enable-layerwise-offload",
+        action="store_true",
+        help="Enable layerwise (blockwise) offloading on DiT modules.",
+    )
+    parser.add_argument(
+        "--layerwise-num-gpu-layers",
+        type=int,
+        default=1,
+        help="Number of ready layers (blocks) to keep on GPU during generation.",
+    )
     parser.add_argument(
         "--ulysses_degree",
         type=int,
@@ -123,6 +134,8 @@ def main():
 
     omni = Omni(
         model=args.model,
+        enable_layerwise_offload=args.enable_layerwise_offload,
+        layerwise_num_gpu_layers=args.layerwise_num_gpu_layers,
         vae_use_slicing=args.vae_use_slicing,
         vae_use_tiling=args.vae_use_tiling,
         boundary_ratio=args.boundary_ratio,

@@ -0,0 +1,126 @@
+import sys
+from pathlib import Path
+
+import pytest
+import torch
+from vllm.distributed.parallel_state import cleanup_dist_env_and_memory
+
+from tests.utils import GPUMemoryMonitor
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+from vllm_omni.platforms import current_omni_platform
+
+# ruff: noqa: E402
+REPO_ROOT = Path(__file__).resolve().parents[2]
+if str(REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(REPO_ROOT))
+
+from vllm_omni import Omni
+
+models = ["Wan-AI/Wan2.2-T2V-A14B-Diffusers"]
+
+
+def run_inference(
+    model_name: str,
+    layerwise_offload: bool = False,
+    num_gpu_layers: int = 1,
+    num_inference_steps: int = 3,
+) -> float:
+    # For now, only support on GPU, so apply torch.cuda operations here
+    # NPU / ROCm platforms are expected to be detected and skipped this test function
+    torch.cuda.empty_cache()
+    device_index = torch.cuda.current_device()
+    monitor = GPUMemoryMonitor(device_index=device_index, interval=0.02)
+    monitor.start()
+
+    m = Omni(
+        model=model_name,
+        enable_layerwise_offload=layerwise_offload,
+        layerwise_num_gpu_layers=num_gpu_layers,
+        boundary_ratio=0.875,
+        flow_shift=5.0,
+    )
+
+    torch.cuda.reset_peak_memory_stats(device=device_index)
+
+    # Refer to tests/e2e/offline_inference/test_t2v_model.py
+    # Use minimal settings for testing
+    height = 480
+    width = 640
+    num_frames = 5
+
+    m.generate(
+        "A cat sitting on a table",
+        OmniDiffusionSamplingParams(
+            height=height,
+            width=width,
+            generator=torch.Generator("cuda").manual_seed(42),
+            guidance_scale=1.0,
+            num_inference_steps=num_inference_steps,
+            num_frames=num_frames,
+        ),
+    )
+
+    peak = monitor.peak_used_mb
+    monitor.stop()
+
+    return peak
+
+
+@pytest.mark.skipif(current_omni_platform.is_npu() or current_omni_platform.is_rocm(), reason="Hardware not supported")
+@pytest.mark.parametrize("model_name", models)
+def test_layerwise_offload_diffusion_model(model_name: str):
+    """Test that layerwise offloading reduces GPU memory usage.
+
+    This test verifies that layerwise offloading significantly reduces peak
+    GPU memory usage compared to loading the entire model on GPU. The layerwise
+    offloader keeps only a single transformer block on GPU at a time, with
+    prefetching for compute-memory overlap.
+    """
+    try:
+        # Run without layerwise offloading (baseline)
+        no_offload_peak_memory = run_inference(model_name, layerwise_offload=False)
+        cleanup_dist_env_and_memory()
+
+        # Run with layerwise offloading (1 layer on GPU)
+        layerwise_offload_peak_memory = run_inference(model_name, layerwise_offload=True, num_gpu_layers=1)
+    except Exception:
+        pytest.fail("Inference failed")
+
+    print(f"Layerwise offload peak memory (1 GPU layer): {layerwise_offload_peak_memory} MB")
+    print(f"No offload peak memory: {no_offload_peak_memory} MB")
+
+    # Verify that layerwise offloading significantly reduces memory usage
+    # Using a threshold of 2500 MB savings to match the CPU offload test
+    assert layerwise_offload_peak_memory + 2500 < no_offload_peak_memory, (
+        f"Layerwise offload peak memory {layerwise_offload_peak_memory} MB "
+        f"should be significantly less than no offload peak memory {no_offload_peak_memory} MB"
+    )
+
+
+@pytest.mark.skipif(current_omni_platform.is_npu() or current_omni_platform.is_rocm(), reason="Hardware not supported")
+@pytest.mark.parametrize("model_name", models)
+def test_layerwise_offload_multiple_gpu_layers(model_name: str):
+    """Test layerwise offloading with multiple GPU layers.
+
+    This test verifies that keeping more layers on GPU increases memory usage
+    but should still be less than loading the entire model. It tests with
+    2 GPU layers vs 1 GPU layer.
+    """
+    try:
+        # Run with 1 GPU layer
+        one_layer_peak = run_inference(model_name, layerwise_offload=True, num_gpu_layers=1)
+        cleanup_dist_env_and_memory()
+
+        # Run with 2 GPU layers
+        two_layers_peak = run_inference(model_name, layerwise_offload=True, num_gpu_layers=2)
+    except Exception:
+        pytest.fail("Inference failed")
+
+    print(f"Layerwise offload peak memory (1 GPU layer): {one_layer_peak} MB")
+    print(f"Layerwise offload peak memory (2 GPU layers): {two_layers_peak} MB")
+
+    # Verify that 2 GPU layers uses more memory than 1 GPU layer
+    # But not excessively more (should be a reasonable increase)
+    assert one_layer_peak < two_layers_peak, (
+        f"1 GPU layer peak {one_layer_peak} MB should be < 2 GPU layers peak {two_layers_peak} MB"
+    )
@@ -288,6 +288,12 @@ class OmniDiffusionConfig:
     # - Text encoders run on GPU while DiT is on CPU
     # - DiT runs on GPU while encoders are on CPU
     enable_cpu_offload: bool = False
+
+    # Layer-wise offloading (block-level offloading) parameters
+    enable_layerwise_offload: bool = False
+    # Number of transformer blocks ready for computation to keep on GPU
+    layerwise_num_gpu_layers: int = 1
+
     use_fsdp_inference: bool = False
     pin_cpu_memory: bool = True  # Use pinned memory for faster transfers when offloading
 

@@ -758,6 +758,7 @@ class QwenImageTransformer2DModel(CachedTransformer):
     # -- typically a transformer layer
     # used for torch compile optimizations
     _repeated_blocks = ["QwenImageTransformerBlock"]
+    _layerwise_offload_blocks_attr = "transformer_blocks"
     packed_modules_mapping = {
         "to_qkv": ["to_q", "to_k", "to_v"],
         "add_kv_proj": ["add_q_proj", "add_k_proj", "add_v_proj"],

@@ -531,6 +531,7 @@ class WanTransformer3DModel(nn.Module):
     """
 
     _repeated_blocks = ["WanTransformerBlock"]
+    _layerwise_offload_blocks_attr = "blocks"
     packed_modules_mapping = {
         "to_qkv": ["to_q", "to_k", "to_v"],
     }